Final Project: NHL Shot Statistics, The Effect of Defending Ice Time on Expected Goals and Shot Outcomes

Author

Connor Raney

Published

April 28, 2024

Introduction

For my final project, I will be working alone, and will be doing it on the topic of hockey data. There is a wide variety of hockey data, but some of the most interesting data to me is shot data. Every single shot that has been taken over the past 15-20 years has been tracked with over 140 variables per shot, giving the most context possible to characterize every single kind of shot being taken. Now, why is this data important to study? Shot data is some of the most important data in hockey, as it is used as the base of all hockey statistics. Shot data is used to calculate xGoals (expected goals) which is one of the main drivers for determining: 1. The performance of a given team 2. The performance of a given teams’ goalie

Using a model the probability of each shot being a goal is calculated using factors such as the distance from the net, angle of the shot, type of shot, and what happened before the shot, amongst other factors. Now, by adding up the probabilities of a team’s shots during a game, you can calculate a team’s expected goals, and essentially measure the amount of offense that was generated during a game. This can also be used to see if a team got ‘unlucky’ or ‘goalied’ (outplayed by the other teams’ goalie and could’ve easily won the game if they got anywhere near their xG amount) or if a team simply just did not generate enough offense. It can also be used to see the performance of all goalies in the league, as the main driver for the ‘best’ goalie every season is their goals saved above expected (xG – Actual Goals). Now, as you can see the base of almost all team statistics is based off xG, which is based off all the shot data collected.

In addition to this, the Moneypuck model, amongst other models, has xGoal, xFroze (puck stopped by the goalie and whistle blown), xRebound (the shot creates a rebound), xPlayContinuedInZone (the play continues), xPlayContinuedOutsideZone (vice versa), and xPlayStopped (play stopped for other reason, ex. puck goes into the netting and out of play), will all total up to 1, as these are all the possible options that can come from a shot. Having given some background to what shot data is, and what xGoals are, you can now see why shot data is so important in hockey as it is used not only for these important calculations, but much more. Moneypuck, a hockey statistics and prediction website, has publicly accessible shot data, totaling up to 1,717,746 shots from the 2007-2022 seasons, in addition to almost 100,000 shots from this season so far. They also have a full description of all the variables in addition to more details with a full csv data dictionary which I will attach. Moneypuck shot data is collected both from the ‘semi-public’ NHL API, ESPN, and other sources which help them to compile a full dataset of shot data to work with.

The Question

Here is one possible question that I have constructed using the Moneypuck data that I believe would work well for research question: Influence of Defensive Player Fatigue on Offensive Expected Goals: Does the amount of time the defensive team has collectively been on the ice affect the other (offensive) team’s generation of scoring chances, measured by expected goals (xGoals)? Outcome Variable (Dependent Variable): xGoals: The expected goals value of each shot taken by the offensive team. Treatment Variable (Independent Variable): defensiveTeamIceTime: This would be a constructed variable representing the total amount of time the defensive team’s players have collectively spent on the ice up to the point of each shot taken by the offensive team. Can be calculated based on the following variables: defendingTeamForwardsOnIce, defendingTeamDefencemenOnIce, defendingTeamAverageTimeOnIceOfForwards, and defendingTeamAverageTimeOnIceOfDefencemen. Potential Confounders: shotDistance: The distance from the net at which the shot is taken. shotType: The type of shot (e.g., slap, wrist, backhand). shotAngle: The angle of the shot relative to the goal. speedFromPreviousEvent: The speed of the player from the previous event to the shot. manAdvantageSituation: The man advantage situation (e.g., power play, even strength). defensiveTeamSkaters: The number of skaters on the ice for the defensive team. timeSinceLastEvent: The time elapsed since the last game event before the shot. Potential Colliders: flurryAdjustedXGoals: The flurry adjusted expected goals value might be influenced by both the defensive team’s ice time (as it affects the likelihood of flurries) and the regular xGoals (as it is a modified version of xGoals). We can also investigate xGoals vs. Shot Outcomes in these cases to see if more ice time may be associated with more goals against vs. expected.

Data

From Moneypuck, “All historical shot data is available to download. This includes 1,717,746 shots from the 2007-2008 to 2022-2023 seasons. Data for the 2023-2024 season is also available and updated nightly on this page. Saved shots on goal, missed shots, and goals are included. Blocked shots are not included in these datasets. There are 124 attributes for each shot, including everything from the player and goalie involved in the shot to angles, distances, what happened before the shot, and how long players had been on the ice when the shot was taken. Each shot also has model scores for its probability of being a goal (xGoals) as well as other models such as for the chance there will be a rebound after the shot, the probability the shot will miss the net, and whether the goalie will freeze the puck after the shot. The data has been collected from several sources including the NHL and ESPN. A good amount of data cleaning has also been done on the data. Arena adjusted shot coordinates and distances are also calculated in the dataset using the strategy War-On-Ice used from the method proposed by Schuckers and Curros.”

We will only be using the data from the 2022-2023 and 2023-2024 season however, as the dataset with 1.7m shots is just too large to use for the computing power that I have access to for this project.

The data has been downloded from Moneypuck with all shots as of 2024-04-25 14:45 Eastern Time. You can find the data at the following link: https://moneypuck.com/data.htm

Data Dictionary

Variable Definition
shotID Unique id for each shot homeTeamCode The home team in the game. For example: TOR, MTL, NYR, etc
awayTeamCode The away team in the game
season Season the shot took place in. Example: 2009 for the 2009-2010 season
isPlayoffGame Set to 1 if a playoff game, otherwise 0 game_id The NHL Game_id of the game the shot took place in
homeTeamWon Set to 1 if the home team won the game. Otherwise 0.
id The event # of the shot in the game time Seconds into the game of the shot
timeUntilNextEvent Time between the shot and the next event that happens in the game after the shot
timeSinceLastEvent Time between the shot and the event that took place before the shot period Period of the game
team The team taking the shot. HOME or AWAY
location The zone the shot took place in. HOMEZONE, AWAYZONE, or Neu. Zone
event Whether the shot was a shot on goal (SHOT), goal, (GOAL), or missed the net (MISS)
goal Set to 1 if shot was a goal. Otherwise 0
shotPlayContinuedOutsideZone Set to 1 if play continued after the shot. (not a goal, goalie stop, or out of play), but the next event was outside of the attacking zone. Otherwise 0.
shotPlayContinuedInZone Set to 1 if play continued after the shot. (not a goal, goalie stop, or out of play) and the next event was inside the attacking zone. Otherwise 0. shotGoalieFroze Set to 1 if the goalie froze the puck within 1 second of the shot. Otherwise 0
shotPlayStopped Set to 1 if the play stopped after the shot for a reason beyond a goalie freeze. (Puck went outside the playing surface, dislodged net, etc). Otherwise 0
shotGeneratedRebound Set to 1 if the shot generated a rebound shot within 3 seconds of the this shot.
homeTeamGoals Home team goals before the shot took place
awayTeamGoals Away team goals before the shot took palce
xCord The X coordinate “North South” on the ice of the shot. Feet from red line. -89 and 89 are the goal lines at each of the rink
yCord The Y coordinate “East West” on the ice of the shot. The middle of the ice has a y-coordinate of 0 xCordAdjusted Adjusts the x coordinate as if all shots were at the right end of the rink. Usually makes the coordinate a positive number
yCordAdjusted Adjusts the y coordinate as if all shots were at the right end of the rink. shotAngle The angle of the shot in degrees. Is a positive number if the shot is from the left side of the ice.
shotAngleAdjusted The absolute value of the shot angle
shotAnglePlusRebound The difference in angle between the previous shot and this shot if this shot is a rebound. Is otherwise set to 0
shotAngleReboundRoyalRoad Set to 1 if the puck went through the middle of the between this shot and previous shot if this shot is a rebound.
shotDistance The distance from the net of the shot in feet. Net is defined as being at the (89,0) coordinates
shotType Type of the shot. (Slap, Wrist, etc)
shotOnEmptyNet Set to 1 if the shot was on an empty net. Otherwise 0.
shotRebound Set to 1 if the shot is a rebound. (If the last event was a shot and within 3 seconds of this shot) shotAnglePlusReboundSpeed The shotAnglePlusRebound variable divded by time between the last shot and this one. (How fast the angle changed)
shotRush Set to 1 if the shot was on a rush. (If the last event was in another zone and within 4 seconds)
speedFromLastEvent The distance between the shot location and the previous event’s location divded by the number of seconds between them
lastEventxCord The x coorinate of the last event before the shot
lastEventyCord The y coorinate of the last event before the shot
distanceFromLastEvent The distance between the shot location and the previous event’s location in feet
lastEventShotAngle The shot angle of the shot directly before this shot. (If the last event was a shot)
lastEventShotDistance The shot distance of the shot directly before this shot. (If the last event was a shot) lastEventCategory The type of event before the shot.Shot, hit, etc.
lastEventTeam The team that did the last event. HOME or AWAY. If last event was a faceoff is the team that won the faceoff
homeEmptyNet Whether the home team’s net is empty at the time of the shot
awayEmptyNet Whether the away team’s net is empty at the time of the shot
homeSkatersOnIce The number of skaters on the ice for the home team. Does not count the goalie
awaySkatersOnIce The number of skaters on the ice for the away team. Does not count the goalie
awayPenalty1TimeLeft The number of seconds left in the penalty on the away team. If the penalty that will expire first if multiple penalities
awayPenalty1Length The total length in seconds of the penalty on the away team. Is the penalty that will expire first if multiple penalities on the away team
homePenalty1TimeLeft The number of seconds left in the penalty on the home team. If the penalty that will expire first if multiple penalities
homePenalty1Length The total length in seconds of the penalty on the home team. Is the penalty that will expire first if multiple penalities on the home team
playerPositionThatDidEvent The position of the player doing the shot. L for Left Wing, R for Right Wing, D for Defenceman, C for Centre.
playerNumThatDidEvent The jersey number of the player that took the shot
playerNumThatDidLastEvent The jersey number of the player that did the last event before the shot. Only populated if the previous event is a shot attempt. Otherwise 0.
lastEventxCord_adjusted Adjusts the last event’s x coordinate similar to the other adjusted coordinate variables
lastEventyCord_adjusted Adjusts the last event’s y coordinate similar to the other adjusted coordinate variables
timeSinceFaceoff Seconds since there has been a faceoff at the time of the shot
goalieIdForShot The NHL player id for the goalie the shot is on.
goalieNameForShot The First and Last name of the goalie the shot is on.
shooterPlayerId The NHL player id of the skater taking the shot shooterName The First and Last name of the player taking the shot
shooterLeftRight Whether the shooter is a left or right shot. L/R
shooterTimeOnIce playing time in seconds that have passed since the shooter started their shift
shooterTimeOnIceSinceFaceoff The minimum of the playing time in seconds since the last faceoff and the playing time that has passed since the shooter started their shift
shootingTeamForwardsOnIce Number of forwards the shooting team has on the ice shootingTeamDefencemenOnIce Number of defencemen the shooting team has on the ice
shootingTeamAverageTimeOnIce The average playing time in seconds the shooting team’s players have been on the ice
shootingTeamAverageTimeOnIceOfForwards The average playing time in seconds the shooting team’s forwards have been on the ice
shootingTeamAverageTimeOnIceOfDefencemen The average playing time in seconds the shooting team’s defencemen have been on the ice shootingTeamMaxTimeOnIce The maximum playing time in seconds the shooting team’s players have been on the ice
shootingTeamMaxTimeOnIceOfForwards The maximum playing time in seconds the shooting team’s forwards have been on the ice
shootingTeamMaxTimeOnIceOfDefencemen The maximum playing time in seconds the shooting team’s defencemen have been on the ice shootingTeamMinTimeOnIce The minimum playing time in seconds the shooting team’s players have been on the ice
shootingTeamMinTimeOnIceOfForwards The minimum playing time in seconds the shooting team’s forwards have been on the ice
shootingTeamMinTimeOnIceOfDefencemen The minimum playing time in seconds the shooting team’s defencemen have been on the ice shootingTeamAverageTimeOnIceSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamAverageTimeOnIceOfForwardsSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamAverageTimeOnIceOfDefencemenSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamMaxTimeOnIceSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamMaxTimeOnIceOfForwardsSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamMaxTimeOnIceOfDefencemenSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamMinTimeOnIceSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamMinTimeOnIceOfForwardsSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
shootingTeamMinTimeOnIceOfDefencemenSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamForwardsOnIce Number of forwards the defending team has on the ice
defendingTeamDefencemenOnIce Number of defencemen the defending team has on the ice
defendingTeamAverageTimeOnIce The average playing time in seconds the shooting team’s players have been on the ice
defendingTeamAverageTimeOnIceOfForwards The average playing time in seconds the shooting team’s forwards have been on the ice
defendingTeamAverageTimeOnIceOfDefencemen The average playing time in seconds the shooting team’s defencemen have been on the ice defendingTeamMaxTimeOnIce The maximum playing time in seconds the shooting team’s players have been on the ice
defendingTeamMaxTimeOnIceOfForwards The maximum playing time in seconds the shooting team’s forwards have been on the ice
defendingTeamMaxTimeOnIceOfDefencemen The maximum playing time in seconds the shooting team’s defencemen have been on the ice defendingTeamMinTimeOnIce The minimum playing time in seconds the shooting team’s players have been on the ice
defendingTeamMinTimeOnIceOfForwards The minimum playing time in seconds the shooting team’s forwards have been on the ice
defendingTeamMinTimeOnIceOfDefencemen The minimum playing time in seconds the shooting team’s defencemen have been on the ice defendingTeamAverageTimeOnIceSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamAverageTimeOnIceOfForwardsSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamAverageTimeOnIceOfDefencemenSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamMaxTimeOnIceSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamMaxTimeOnIceOfForwardsSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamMaxTimeOnIceOfDefencemenSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamMinTimeOnIceSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamMinTimeOnIceOfForwardsSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
defendingTeamMinTimeOnIceOfDefencemenSinceFaceoff Same as equivalent variable above but only counting seconds since the faceoff if the player was on the ice before the faceoff
offWing Set to 1 if the shot is from the left side of the ice and the shooter is a right shot, or vice-versa. Otherwise 0
arenaAdjustedShotDistance The shot distance adjusted for arena recording bias. Uses the same methodology as War On Ice proposed by Schuckers and Curro. blog.war-on-ice.com/
arenaAdjustedXCord The x coordinate of the arena adjusted shot location. Always a positive number
arenaAdjustedYCord The y coordinate of the arena adjusted shot location
arenaAdjustedYCordAbs The absolute value of the arena adjusted y coordinate
timeDifferenceSinceChange The shooting team’s minimum time on ice of any player minus the defending team’s minimum time on ice of any player
averageRestDifference The shooting team’s average time on ice since a faceoff minus the defending team’s average time on ice since a faceoff
xGoal The probability the shot will be a goal. Also known as “Expected Goals” xFroze The probability the goalie will freeze the puck and their will be a stoppage of play within 1 second of the shot
xRebound The probability there will be another shot within 3 seconds of this shot occuring
xPlayContinuedInZone The probability that the play will continue in the zone after the shot. Defined as the next event after the shot also occuring in the offensive zone and no player changes occuring. Does not include the xRebound probability
xPlayContinuedOutsideZone The probability that the play leaves the zone after the shot.
xPlayStopped The probability the play stops after the shot for a reason other than a goal or goalie freezing the puck. For example, the puck is shot into the netting or the net is dislodged, etc.
xShotWasOnGoal The probability the shot was on net. (Either a goal or a goalie save)
isHomeTeam Set to 1 if the shooting team is the home team
shotWasOnGoal Set to 1 if the shot was on net. (Either a goal or a goalie save)
teamCode The team code of the shooting team. For example, TOR, NYR, etc
arenaAdjustedXCordABS Absolute value of the arenaAdjustedXCord

Notes:  
    
If there was an empty net the goalie name will be blank 
The model scores for xGoal, xFroze, xRebound, xPlayContinuedInZone, xPlayContinuedOutsideZone, and xPlayStopped will sum up to 1    
If time on ice variables are not available, they are set to 999 for the 'minimum' variables and 0 for the 'maximum' variables. This occures for a few shots per season on average, mostly in 2007 and 2008. 
The shot distance adjustment algorithm designed by proposed by Schuckers and Curros used in this dataset is explained here: http://www.sloansportsconference.com/wp-content/uploads/2013/Total%20Hockey%20Rating%20(THoR)%20A%20comprehensive%20statistical%20rating%20of%20National%20Hockey%20League%20forwards%20and%20defensemen%20based%20upon%20all%20on-ice%20events.pdf 
The data has been collected from several sources including the NHL and ESPN 
No guarantees are made to the quality of the data. NHL shot data is known to have issues and biases.    
Please reach out through MoneyPuck.com if you have any feedback 
You are welcome to use this data in any work. Just please cite MoneyPuck.com

1. Importing the Data

Importing the Moneypuck Dataset & Examining the initial variables. After merging these two datasets together, we have 238,304 shots in total from the 2022-2023 season and the 2023-2024 data. I was looking forward to using all of the data we have available, but we will have to settle for just these two seasons as the dataset for all of them is just too large.

# Set working directory
knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
setwd("/Users/connorraney/Desktop/desktop/QTM3605/scripts/Final Project")

# Import 2022-2023 data
shots2022 <- read.csv("shots_2022.csv") 

# Import 2023-2024 shot data
shots2023 <- read.csv("shots_2023.csv")

# Now, we need to merge these two together into one big dataset. We can do this by using the rbind() function.
shots <- rbind(shots2022, shots2023)

2. Evaluating the Variables & Cleaning the Data for analysis

We have a few things we need to figure out here. First of all, we clearly see by the data dictionary that the ice time variables include defendingTeamAverageTimeOnIce, defendingTeamMaxTimeOnIce, etc. Overall, the most useful variable for our investigation is going to be the defendingTeamAverageTimeOnIce, but, there are going to be many other variables that can give us an insight into the ice time patterns and how they relate to outcomes. - defendingTeamAverageTimeOnIce (our main measure of how much time the defending team has spent on ice at the time of the shot, on average, across all skaters) - defendingTeamMaxTimeOnIce (The maximum playing time in seconds the shooting team’s players have been on the ice) - defendingTeamAverageTimeOnIceOfForwards The average playing time in seconds the shooting team’s forwards have been on the ice
- defendingTeamAverageTimeOnIceOfDefencemen The average playing time in seconds the shooting team’s defencemen have been on the ice - defendingTeamMaxTimeOnIceOfForwards (The maximum playing time in seconds the shooting team’s forwards have been on the ice) - defendingTeamMaxTimeOnIceOfDefencemen (The maximum playing time in seconds the shooting team’s defencemen have been on the ice) list diff defending variables Also, we have to consider situationally what we are measuring and what outcomes we are expecting. For example, the average time on ice for a team will change mostly likely from even strength (5v5), versus a power play, versus an empty net situation. So, we may want to filter the data to first focus on 5v5 outcomes, and then expand to look at other situations later, like power plays (5v4) and other situations that will arise. For this reason, we will break up the situations into: - 5v5 (even strength): filter(shots, situation == “5v5”) - All others: so this is any situation where there are a different number of skaters on each team, and we are doing this for simplicities sake, as if we were to go through and have all of the possible situations, we would have way too many for our analysis to be done in a productive manner for this project.

## Load the dplyr library
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# Create new variables for the total number of skaters on ice for both shooting and defending teams
shots <- shots %>%
  mutate(
    shootingTeamSkatersOnIce = shootingTeamForwardsOnIce + shootingTeamDefencemenOnIce,
    defendingTeamSkatersOnIce = defendingTeamForwardsOnIce + defendingTeamDefencemenOnIce
  )

# Check for any missing data before we create filters

# Check for missing data in the 'shootingTeamSkatersOnIce' and 'defendingTeamSkatersOnIce' columns
missing_shooting_skaters <- sum(is.na(shots$shootingTeamSkatersOnIce))
missing_defending_skaters <- sum(is.na(shots$defendingTeamSkatersOnIce))

# Print out the number of missing values
print(paste("Missing values in shootingTeamSkatersOnIce:", missing_shooting_skaters))
[1] "Missing values in shootingTeamSkatersOnIce: 0"
print(paste("Missing values in defendingTeamSkatersOnIce:", missing_defending_skaters))
[1] "Missing values in defendingTeamSkatersOnIce: 0"
# The results come out as 0 for both, so we can then move onto filtering the data

# Filter for even strength (5v5)
even_strength_shots <- shots %>%
  filter(shootingTeamSkatersOnIce == 5 & defendingTeamSkatersOnIce == 5)

# Filter for non-even strength (not 5v5)
non_even_strength_shots <- shots %>%
  filter(shootingTeamSkatersOnIce != 5 | defendingTeamSkatersOnIce != 5)

# Count the number of shots in the main dataframe
total_shots <- nrow(shots)

# Count the number of shots in the even strength subset
even_strength_count <- nrow(even_strength_shots)

# Count the number of shots in the non-even strength subset
non_even_strength_count <- nrow(non_even_strength_shots)

# Print the counts to verify
print(paste("Total shots:", total_shots))
[1] "Total shots: 238304"
print(paste("Even strength shots:", even_strength_count))
[1] "Even strength shots: 185634"
print(paste("Non-even strength shots:", non_even_strength_count))
[1] "Non-even strength shots: 52670"
# Check if the sum of subsets equals the total number of shots
if (total_shots == even_strength_count + non_even_strength_count) {
  print("The counts match! The sum of even and non-even strength shots equals the total number of shots.")
} else {
  print("There is a discrepancy in the counts. Please check the data and filtering criteria.")
}
[1] "The counts match! The sum of even and non-even strength shots equals the total number of shots."
# Check what the mean ice time is for both shooting and defending teams & store for later use, for all situations, even strength, and non-even strength
# All situations
meanDefendingTeamIceTimeAllSituations <- mean(shots$defendingTeamAverageTimeOnIce)
meanShootingTeamIceTimeAllSituations <- mean(shots$shootingTeamAverageTimeOnIce)
# Even Strength
meanDefendingTeamIceTimeEvenStrength <- mean(even_strength_shots$defendingTeamAverageTimeOnIce)
meanShootingTeamIceTimeEvenStrength <- mean(even_strength_shots$shootingTeamAverageTimeOnIce)
# Non-Even Strength
meanDefendingTeamIceTimeNonEvenStrength <- mean(non_even_strength_shots$defendingTeamAverageTimeOnIce)
meanShootingTeamIceTimeNonEvenStrength <- mean(non_even_strength_shots$shootingTeamAverageTimeOnIce)


# Create a variable for the iceTimeDifference in the offensiveTeam - defendingTeam to measure how the difference in offensiveIceTime vs defensiveIceTime affects the amount of xGoals that are given up, and then we will store this for later use
# All situations
shots$shootingTeamIceTimeDifferenceAllSituations <- shots$shootingTeamAverageTimeOnIce - shots$defendingTeamAverageTimeOnIce
# Even strength
even_strength_shots$shootingTeamIceDifferenceEvenStrength <- even_strength_shots$shootingTeamAverageTimeOnIce - even_strength_shots$defendingTeamAverageTimeOnIce
# Non-Even Strength
non_even_strength_shots$shootingTeamIceDifferenceNonEvenStrength <- non_even_strength_shots$shootingTeamAverageTimeOnIce - non_even_strength_shots$defendingTeamAverageTimeOnIce


# Save the means for later use
# All situations
averageShootingTeamIceTimeDifferenceAllSituations <- meanShootingTeamIceTimeAllSituations - meanDefendingTeamIceTimeAllSituations
# Even Strength
averageShootingTeamIceTimeDifferenceEvenStrength <- meanShootingTeamIceTimeEvenStrength - meanDefendingTeamIceTimeEvenStrength
# Non-Even Strength
averageShootingTeamIceTimeDifferenceNonEvenStrength <- meanShootingTeamIceTimeNonEvenStrength - meanDefendingTeamIceTimeNonEvenStrength


# Change shotType to factor for later use in PSM
shots$shotType <- as.factor(shots$shotType)

# Change yCoordAdjusted to absolute value (make all values positive)
shots$yCordAdjusted <- abs(shots$yCordAdjusted)

# Evaluating our main variables
# Descriptive Statistics for the variables we will use for PSM

# Define the variables to run summary on
variables <- c("xCordAdjusted", "yCordAdjusted", "shotRebound", "speedFromLastEvent", "shotAngleAdjusted", "xGoal")

# Apply the summary function to each of the selected variables in the shots dataframe
summary_results <- lapply(shots[variables], summary)

# Print the summary results for each variable
print(summary_results)
$xCordAdjusted
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   49.00   65.00   61.88   78.00  100.00 

$yCordAdjusted
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    6.00   14.00   15.84   25.00   43.00 

$shotRebound
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.07156 0.00000 1.00000 

$speedFromLastEvent
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.113   4.800   8.467  10.617 193.747 

$shotAngleAdjusted
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   16.39   30.96   33.03   46.40   88.45 

$xGoal
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001537 0.014789 0.036877 0.072713 0.087609 0.976651 

3. Visualizing our data

# Load the ggplot2 library
library(ggplot2)
# defendingTeamAverageTimeOnIce vs xGoal - linear regression
ggplot(shots, aes(x=defendingTeamAverageTimeOnIce, y=xGoal)) +
  geom_point() +
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

# Same but with logistic regression
ggplot(shots, aes(x=defendingTeamAverageTimeOnIce, y=xGoal)) +
  geom_point(alpha=0.4) +
  geom_smooth(method="glm", method.args=list(family=binomial(link="logit")), se=TRUE, color="blue")
`geom_smooth()` using formula = 'y ~ x'
Warning in eval(family$initialize): non-integer #successes in a binomial glm!

4. Running Linear Regressions to analyze our data

Here, we are going to run 3 groups of regressions. All will include regressions run on all situations, then with even strength shots, and then with non-even strength shots. We can then look at the results and estimate the significance and the effects of ice time on the expected goals. Then, in total we will have a group of 9 regressions to analyze, with a good mix of situations and different types to investigate. We will define the models in a list, and then we will create a tidy summary of each model and store it in a dataframe, and then display it in a table to see the results in an easy to read manner. 1. Defending Ice Time 2. Shooting Ice Time 3. Offensive Ice Time - Defensive Ice Time (Ice Time Difference)

# Load necessary libraries
library(dplyr)
library(modelsummary)

# Define the regression models
defendingTeamAllSituations <- lm(xGoal ~ defendingTeamAverageTimeOnIce, data = shots)
defendingTeamEvenStrength <- lm(xGoal ~ defendingTeamAverageTimeOnIce, data = even_strength_shots)
defendingTeamNonEvenStrength <- lm(xGoal ~ defendingTeamAverageTimeOnIce, data = non_even_strength_shots)

shootingTeamAllSituations <- lm(xGoal ~ shootingTeamAverageTimeOnIce, data = shots)
shootingTeamEvenStrength <- lm(xGoal ~ shootingTeamAverageTimeOnIce, data = even_strength_shots)
shootingTeamNonEvenStrength <- lm(xGoal ~ shootingTeamAverageTimeOnIce, data = non_even_strength_shots)

iceTimeDifferenceAllSituations <- lm(xGoal ~ shootingTeamIceTimeDifferenceAllSituations, data = shots)
iceTimeDifferenceEvenStrength <- lm(xGoal ~ shootingTeamIceDifferenceEvenStrength, data = even_strength_shots)
iceTimeDifferenceNonEvenStrength <- lm(xGoal ~ shootingTeamIceDifferenceNonEvenStrength, data = non_even_strength_shots)

# Combine models into a list
models <- list(
  DefendingTeamAllSituations = defendingTeamAllSituations,
  DefendingTeamEvenStrength = defendingTeamEvenStrength,
  DefendingTeamNonEvenStrength = defendingTeamNonEvenStrength,
  ShootingTeamAllSituations = shootingTeamAllSituations,
  ShootingTeamEvenStrength = shootingTeamEvenStrength,
  ShootingTeamNonEvenStrength = shootingTeamNonEvenStrength,
  IceTimeDifferenceAllSituations = iceTimeDifferenceAllSituations,
  IceTimeDifferenceEvenStrength = iceTimeDifferenceEvenStrength,
  IceTimeDifferenceNonEvenStrength = iceTimeDifferenceNonEvenStrength
)

# Use modelsummary to create a summary table of the models
modelsummary(models, stars = TRUE, 
             model_names = c(
               "Defending Team All Situations", 
               "Defending Team Even Strength", 
               "Defending Team Non-Even Strength", 
               "Shooting Team All Situations", 
               "Shooting Team Even Strength", 
               "Shooting Team Non-Even Strength", 
               "Ice Time Difference All Situations", 
               "Ice Time Difference Even Strength", 
               "Ice Time Difference Non-Even Strength"
             ), 
             fmt = "%.5f") # use fmt to set the decimal places to 5 as digits was not working
tinytable_mw8c3psz6pf68qepq0c6
DefendingTeamAllSituations DefendingTeamEvenStrength DefendingTeamNonEvenStrength ShootingTeamAllSituations ShootingTeamEvenStrength ShootingTeamNonEvenStrength IceTimeDifferenceAllSituations IceTimeDifferenceEvenStrength IceTimeDifferenceNonEvenStrength
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 0.05298*** 0.05517*** 0.07419*** 0.05288*** 0.05521*** 0.09597*** 0.07298*** 0.06188*** 0.11158***
(0.00045) (0.00037) (0.00144) (0.00041) (0.00039) (0.00136) (0.00021) (0.00018) (0.00073)
defendingTeamAverageTimeOnIce 0.00057*** 0.00021*** 0.00092***
(0.00001) (0.00001) (0.00003)
shootingTeamAverageTimeOnIce 0.00063*** 0.00025*** 0.00031***
(0.00001) (0.00001) (0.00003)
shootingTeamIceTimeDifferenceAllSituations 0.00009***
(0.00001)
shootingTeamIceDifferenceEvenStrength -0.00006***
(0.00001)
shootingTeamIceDifferenceNonEvenStrength -0.00032***
(0.00003)
Num.Obs. 238304 185634 52670 238304 185634 52670 238304 185634 52670
R2 0.010 0.002 0.015 0.013 0.002 0.003 0.000 0.000 0.002
R2 Adj. 0.010 0.002 0.015 0.013 0.002 0.003 0.000 0.000 0.002
AIC -412081.5 -447117.1 -41313.5 -412799.5 -447068.4 -40655.7 -409664.8 -446703.6 -40644.3
BIC -412050.3 -447086.7 -41286.9 -412768.3 -447038.0 -40629.1 -409633.7 -446673.2 -40617.7
Log.Lik. 206043.734 223561.535 20659.770 206402.739 223537.195 20330.846 204835.400 223354.783 20325.166
RMSE 0.10 0.07 0.16 0.10 0.07 0.16 0.10 0.07 0.16

5. Running Logistic Regressions to analyze our data

# Load necessary libraries
library(dplyr)
library(modelsummary)

# Define the logistic regression models
defendingTeamAllSituations <- glm(xGoal ~ defendingTeamAverageTimeOnIce, data = shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
defendingTeamEvenStrength <- glm(xGoal ~ defendingTeamAverageTimeOnIce, data = even_strength_shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
defendingTeamNonEvenStrength <- glm(xGoal ~ defendingTeamAverageTimeOnIce, data = non_even_strength_shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
shootingTeamAllSituations <- glm(xGoal ~ shootingTeamAverageTimeOnIce, data = shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
shootingTeamEvenStrength <- glm(xGoal ~ shootingTeamAverageTimeOnIce, data = even_strength_shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
shootingTeamNonEvenStrength <- glm(xGoal ~ shootingTeamAverageTimeOnIce, data = non_even_strength_shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
iceTimeDifferenceAllSituations <- glm(xGoal ~ shootingTeamIceTimeDifferenceAllSituations, data = shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
iceTimeDifferenceEvenStrength <- glm(xGoal ~ shootingTeamIceDifferenceEvenStrength, data = even_strength_shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
iceTimeDifferenceNonEvenStrength <- glm(xGoal ~ shootingTeamIceDifferenceNonEvenStrength, data = non_even_strength_shots, family = binomial)
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
# Combine models into a list
models <- list(
    DefendingTeamAllSituations = defendingTeamAllSituations,
    DefendingTeamEvenStrength = defendingTeamEvenStrength,
    DefendingTeamNonEvenStrength = defendingTeamNonEvenStrength,
    ShootingTeamAllSituations = shootingTeamAllSituations,
    ShootingTeamEvenStrength = shootingTeamEvenStrength,
    ShootingTeamNonEvenStrength = shootingTeamNonEvenStrength,
    IceTimeDifferenceAllSituations = iceTimeDifferenceAllSituations,
    IceTimeDifferenceEvenStrength = iceTimeDifferenceEvenStrength,
    IceTimeDifferenceNonEvenStrength = iceTimeDifferenceNonEvenStrength
)

# Use modelsummary to create a summary table of the models
modelsummary(models,
    stars = TRUE,
    model_names = c(
        "Defending Team All Situations",
        "Defending Team Even Strength",
        "Defending Team Non-Even Strength",
        "Shooting Team All Situations",
        "Shooting Team Even Strength",
        "Shooting Team Non-Even Strength",
        "Ice Time Difference All Situations",
        "Ice Time Difference Even Strength",
        "Ice Time Difference Non-Even Strength"
    ),
    fmt = "%.5f"
) # use fmt to set the decimal places to 5 as digits was not working
tinytable_jlm43ewjeucamv6tg0zg
DefendingTeamAllSituations DefendingTeamEvenStrength DefendingTeamNonEvenStrength ShootingTeamAllSituations ShootingTeamEvenStrength ShootingTeamNonEvenStrength IceTimeDifferenceAllSituations IceTimeDifferenceEvenStrength IceTimeDifferenceNonEvenStrength
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -2.82896*** -2.83252*** -2.44598*** -2.81670*** -2.83158*** -2.23244*** -2.54207*** -2.71881*** -2.07710***
(0.01689) (0.02136) (0.02853) (0.01498) (0.02220) (0.02663) (0.00796) (0.01020) (0.01413)
defendingTeamAverageTimeOnIce 0.00798*** 0.00356*** 0.00875***
(0.00041) (0.00056) (0.00059)
shootingTeamAverageTimeOnIce 0.00827*** 0.00421*** 0.00307***
(0.00037) (0.00070) (0.00049)
shootingTeamIceTimeDifferenceAllSituations 0.00139***
(0.00042)
shootingTeamIceDifferenceEvenStrength -0.00106+
(0.00062)
shootingTeamIceDifferenceNonEvenStrength -0.00330***
(0.00055)
Num.Obs. 238304 185634 52670 238304 185634 52670 238304 185634 52670
AIC 50163.6 24890.3 23200.4 49930.5 24889.7 23420.6 50590.9 24890.9 23449.6
BIC 50184.4 24910.5 23218.1 49951.3 24910.0 23438.3 50611.6 24911.2 23467.3
Log.Lik. -25079.799 -12443.136 -11598.189 -24963.269 -12442.846 -11708.283 -25293.441 -12443.454 -11722.780
RMSE 0.10 0.07 0.16 0.10 0.07 0.16 0.10 0.07 0.16

6. Running Propensity Score Matching to analyze our data

Before starting, we are going to choose 5-6 baseline covariates to use for our analysis with some other variables that may affect the outcome of the xGoals value.

From Moneypuck, we can see these are the variables used in their shot prediction model to produce xGoals:

Variables In Shot Prediction Model:

1.) Shot Distance From Net 2.) Time Since Last Game Event 3.) Shot Type (Slap, Wrist, Backhand, etc) 4.) Speed From Previous Event 5.) Shot Angle 6.) East-West Location on Ice of Last Event Before the Shot 7.) If Rebound, difference in shot angle divided by time since last shot 8.) Last Event That Happened Before the Shot (Faceoff, Hit, etc) 9.) Other team’s # of skaters on ice 10.) East-West Location on Ice of Shot 11.) Man Advantage Situation 12.) Time since current Powerplay started 13.) Distance From Previous Event 14.) North-South Location on Ice of Shot 15.) Shooting on Empty Net

This info, and more of an explanation on the calculations and ways of solving the xGoals calculation can be found at: https://moneypuck.com/about.htm

With this info, I am going to use: 1. Shot Location (x) –> xCordAdjusted (abs value) 2. Shot Location (y) –> yCordAdjusted (abs value) 3. Shot Type –> shotType 4. Shot Rebound –> shotRebound 5. Speed From Previous Event –> speedFromLastEvent 6. Shot Angle –> shotAngleAdjusted (for simplicity sake we will use the abs value)

At first, I also had shotDistance, but realized that with the x and y coordinates, we were most likely using some variables that are too closely related, so I got rid of distance and left in the x and y coordinates.

These are good to use as baselines to work off of for our propensity score matching analysis.

For our treatment, we are going to use the median amount of time on ice for the defending team. Let’s create a treatment variable:

# Create the treatment variable for PSM
medianDefendingTime <- median(shots$defendingTeamAverageTimeOnIce)
print(medianDefendingTime)
[1] 32.2
meanxGoals <- mean(shots$xGoal)
print(meanxGoals)
[1] 0.07271334

Now, with our median value, we can create a binary treatment variable:

shots$treatment <- ifelse(shots$defendingTeamAverageTimeOnIce >= medianDefendingTime, 1, 0)

Now, we can continue with our propensity score matching analysis using the treatment variable we just created and our chosen baseline covariates.

# 1. Estimate Propensity Scores
    # Load necessary library for logistic regression
    library(stats)

    # Estimate the propensity score using logistic regression
    logistic_model <- glm(treatment ~ xCordAdjusted + yCordAdjusted + shotType + shotRebound + speedFromLastEvent + shotAngleAdjusted, family = binomial(), data = shots)

    # Make predictions and store as a propensity score in the dataset
    shots$propensity_score <- predict(logistic_model, type = "response")

# 2. Match Units Using PSM
    # Load necessary library for matching
    library(MatchIt)

    # Use MatchIt to match treatment/control on defined observable covariates
    match_model <- matchit(treatment ~ xCordAdjusted + yCordAdjusted + shotType + shotRebound + speedFromLastEvent + shotAngleAdjusted, method = "nearest", data = shots)
Warning: Fewer control units than treated units; not all treated units will get
a match.
    # Create a data frame with only matched observations
    matched <- match.data(match_model)


# 3. Check Baseline Covariates
    # Load the necessary library
    library(modelsummary)
   # Create a list to store models for each covariate
    models_matched <- list()

    # Add linear models for each covariate against the treatment variable
    models_matched[['xCordAdjusted']] <- lm(xCordAdjusted ~ treatment, data = matched)
    models_matched[['yCordAdjusted']] <- lm(yCordAdjusted ~ treatment, data = matched)
    models_matched[['shotType']] <- lm(shotType ~ treatment, data = matched)
Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
response will be ignored
Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
    models_matched[['shotRebound']] <- lm(shotRebound ~ treatment, data = matched)
    models_matched[['speedFromLastEvent']] <- lm(speedFromLastEvent ~ treatment, data = matched)
    models_matched[['shotAngleAdjusted']] <- lm(shotAngleAdjusted ~ treatment, data = matched)

    # Generate a summary table of the models
    modelsummary(models_matched, stars = TRUE, title = "Differences between treatment and control group, Matched Sample")
Warning in Ops.factor(weighted.residuals(object), 2): '^' not meaningful for
factors
Warning in Ops.factor(res, 2): '^' not meaningful for factors
Warning in Ops.factor(res, 2): '^' not meaningful for factors
Warning in Ops.factor(res, 2): '^' not meaningful for factors
tinytable_sf5rk05k4yr5limj862d
Differences between treatment and control group, Matched Sample
xCordAdjusted yCordAdjusted shotType shotRebound speedFromLastEvent shotAngleAdjusted
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 60.760*** 16.635*** 6.345 0.064*** 8.818*** 33.244***
(0.055) (0.033) (0.001) (0.031) (0.062)
treatment 2.272*** -1.614*** -0.074 0.014*** -0.792*** -0.431***
(0.078) (0.047) (0.001) (0.045) (0.087)
Num.Obs. 238158 238158 238158 238158 238158 238158
R2 0.004 0.005 0.001 0.001 0.000
R2 Adj. 0.004 0.005 0.001 0.001 0.000
AIC 2079363.2 1839798.6 30007.6 1812333.8 2133036.7
BIC 2079394.3 1839829.7 30038.7 1812364.9 2133067.8
Log.Lik. -1039678.588 -919896.297 -15000.788 -906163.898 -1066515.335
RMSE 19.04 11.51 0.26 10.87 21.31
# 4. Estimate Effects Using PSM Sample
    # Load the necessary library
    library(modelsummary)

    # Create a list to store the models
    models_effects <- list()

    # Model with only the treatment effect on xGoal
    models_effects[['Effects']] <- lm(xGoal ~ treatment, data = matched)

    # Model with treatment effect and controls on xGoal
    models_effects[['Effects + Controls']] <- lm(xGoal ~ treatment + xCordAdjusted + yCordAdjusted + shotType + shotRebound + speedFromLastEvent + shotAngleAdjusted, data = matched)

    # Generate a summary table of the models
    modelsummary(models_effects, stars = TRUE, title = "Effects w/PSM")
tinytable_d89apqqtnhswzqg9fmfu
Effects w/PSM
Effects Effects + Controls
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 0.067*** -0.005*
(0.000) (0.002)
treatment 0.012*** 0.004***
(0.000) (0.000)
xCordAdjusted 0.002***
(0.000)
yCordAdjusted -0.001***
(0.000)
shotTypeBACK -0.023***
(0.002)
shotTypeDEFL -0.012***
(0.002)
shotTypeSLAP -0.012***
(0.002)
shotTypeSNAP -0.009***
(0.002)
shotTypeTIP -0.018***
(0.002)
shotTypeWRAP -0.037***
(0.003)
shotTypeWRIST -0.014***
(0.002)
shotRebound 0.110***
(0.001)
speedFromLastEvent 0.001***
(0.000)
shotAngleAdjusted -0.001***
(0.000)
Num.Obs. 238158 238158
R2 0.004 0.338
R2 Adj. 0.004 0.338
AIC -410444.0 -507809.4
BIC -410412.9 -507653.7
Log.Lik. 205225.005 253919.711
RMSE 0.10 0.08

After running this, we can see that the treatment effect is statistically significant, both with and without the controls. For the effects, a coefficient of 0.012, also being significant, shows that there is a very big impact of the treatment on the xGoals. This coefficient represents approximately 16.5% of the mean xGoals per shot which is 0.0727, so we can see that if you are above the median time spent on ice for the defending team, there is a meaningful effect on the likelihood or quality of scoring a goal, measured by xGoals.